In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from the Studio Kalangou (Niger) and Studio Tamani (Mali) daily broadcast news. We share (i) a massive amount of unlabeled audio data (671 hours) in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma, and (ii) a smaller parallel corpus of audio recordings (17 hours) in Tamasheq, with utterance-level translations in the French language. All this data is shared under the Creative Commons BY-NC-ND 3.0 license. We hope these resources will inspire the speech community to develop and benchmark models using the Tamasheq language.
translated by 谷歌翻译
在本文中,我们组合了两个独立的检测方法来识别假新闻:算法Vago使用语义规则与NLP技术相结合,测量文本中的模糊和主体性,而分类器假CLF依赖于卷积神经网络分类和监督深度学习将文本分类为偏见或合法。我们比较四个语料库的两种方法的结果。我们在vago获得的模糊和主观性措施之间找到了积极的相关性,以及由假CLF偏向的文本分类。比较产生互利:Vago有助于解释假CLF的结果。相反,Fake-CLF帮助我们证实并扩展Vago的数据库。使用两个互补技术(以基于规则的VS数据驱动)证明了识别假新闻的挑战性问题。
translated by 谷歌翻译